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Abstract 


One of the significant data mining techniques is clustering. Due to expansion and digitalization of each field, large datasets are being 
generated rapidly. Such large dataset clustering is a challenge for traditional sequential clustering algorithms due to huge processing time. 
Distributed parallel architectures and algorithms are thus helpful to achieve performance and scalability requirement of clustering large datasets. 
In this study, we design and experiment a parallel k-means algorithm using MapReduce programming model and compared the result with 
sequential k-means for clustering varying size of document dataset. The result demonstrates that proposed k-means obtains higher performance 
and outperformed sequential k-means while clustering documents. 
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1. Introduction 


Extracting the knowledge from a dataset by extracting 
useful pattern from the data is the task of data mining [1]. 
Clustering is one of the major research fields in the wide area 
of data mining and analysis. Clustering partitions the data 
objects of a dataset into a number of groups or subsets such 
that objects in a particular subset are similar to each other and 
comparatively dissimilar from objects from other subsets. 
There are many applications of the clustering problems in 
different fields such as image analyses, social science, web 
technology, pattern recognition, telecommunications etc. [2]. 
Document clustering produces clusters or groups of docu- 
ments such that documents in a cluster contain similar 
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property in comparison to documents in other clusters [3]. The 
grouping of documents is done by the occurrence of each word 
in the document files in the dataset. Thus, the clustering job is 
to determine the groups having many of the same words. This 
is achieved by using a similarity measure which lies is the core 
of clustering algorithm. Document clustering is used in various 
requirements such as document organization, document 
browsing, automatic hierarchical representation of documents, 
information filtering, search engine result generation, keyword 
extraction, information retrieval [4,5] etc. 

The most popular clustering algorithm is k-means because 
of its simplicity and efficiency [6]. ICDM Conference ranked it 
second of top 10 clustering algorithms [7]. K-means algorithm 
groups N objects into K clusters maintaining high intra group 
similarity and low inter group similarity of the objects. Initially 
provided with k cluster centers (centroids), the k-means algo- 
rithm places a data object into a group by finding closest cluster 
center using a distances measure. This measure calculates 
similarity distance between each data objects and centroids. 
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Among many available distance measures, Euclidean distance 
is the widely used measure for distance calculation between 
objects and centroids [8]. K-means is an iterative algorithm. It 
converges after a finite number of iterations or if a prior con- 
dition for converge is met. Each iteration provides a set of ob- 
jects to a particular centroid whose distance is lower. A new 
centroid is then calculated based on the mean of the data objects 
of each group. This new centroid is fed to the next iteration and 
so on. The time complexity of k-means is O (nkt), where t is the 
number of iterations [6]. 

The enormous development and fully digitalization of many 
sectors becomes inevitable and a must require for their survival. 
It is expected that in near future the trend of digitization will 
cover each area of technology. The dataset sizes are increasing at 
an explosive rate due to digitalization of field of work such as 
scientific laboratories, industrial manufacturing process, enter- 
prise planning and management, chemical and pharmaceuticals, 
finance and insurance, mining, health care, construction, 
communication, agriculture, education and entertainment etc. to 
name a few. This explosive growth in data size makes the 
existing clustering algorithms become inadequate in perfor- 
mance [9]. The classical clustering algorithms lack in perfor- 
mance which causes enterprises to lose time and also the 
marketing advantage [10]. This situation requires an efficient 
computing platform that can exploit this potential to the 
maximum, keeping in mind the current challenges associated 
with it. Obviously we need to transform classical clustering 
algorithms for the efficient computing platform designed for 
large data processing. Enterprises have experienced that exist- 
ing centralized architectures requires to be replaced with the 
distributed architecture in order to efficiently handle huge vol- 
ume of data [11]. Apache Hadoop is arguably the most influ- 
ential, established and efficient distributed computing 
framework for large data processing [12]. Hadoop has two key 
components: HDFS (Hadoop distributed file system) and 
MapReduce (distributed programming model). While distrib- 
uting the computing process, HDFS takes responsibility to 
divide dataset and send them to multiple computing nodes and 
keep track of them whereas MapReduce process algorithmic 
steps in each computing node. 

In this study, we have proposed a parallelized k-means algo- 
rithm using MapReduce programming model and executed on 
top of Hadoop. A comparison based on execution time is made 
with classical and modified k-means with different data size. 

The rest of the paper is organized as follows. Section 2 
provides a quick insight on several modifications of k-means 
proposed in literature. Section 3 provides the details of our 
document clustering study in MapReduce, specially focusing 
on proposed algorithm in detail. Section 4 provides the 
observation of experiments and its analysis. The last section 
concludes our work. 


2. Literature survey 
The simplicity and efficiency makes k-means one of the 


most popular clustering algorithms. It takes number of cluster 
k as an input prior clustering. It selects k objects, known as 


centroids, from the dataset then calculates distance from each 
object to the centroids using Euclidean distance. The closest 
objects belong to the centroids create a cluster. Iteratively this 
process executes till it reaches a finite number of iterations or a 
clustering criterion [13]. Traditional K-means is used inten- 
sively in several works for document clustering and found 
good result [33—35]. 

Data clustering algorithms become expensive and slow while 
dealing with large data repositories. This large volume of data 
repositories makes analytical operations, retrieval operations 
and process operations time consuming and difficult [14]. It is 
thus inevitable to develop efficient, faster and effective scalable 
and parallel clustering algorithms. It is also required a parallel 
and distributed computing platform for dealing with large vol- 
ume of data and for executing parallel and distributed clustering 
algorithms efficiently and effortlessly. 

Recently many works modified k-means in order to imple- 
ment it on different platforms for clustering large datasets effi- 
ciently. Few works have modified k-means algorithm for 
sequential execution and other works have modified k-means in 
order to execute on different parallel and distributed platforms. 
Some of the modifications are reviewed and briefed below: 

K-means clustering is sensitive to the random selection of 
initial cluster centers. Generally, the clustering result depends 
on early centroid values but there are no formal rules to select 
a good set of initial centroids. Instead of running k-means with 
random centroids, initial centroids are chosen based on the 
preliminary run of k-means in work of Bradley and Fayyad 
[15]. It provides a better clustering result. In [16], authors 
proposed a study which determines the right value of k by 
integration and splitting processes of proposed ISODATA 
technique. The work is effective but it again requires another 
user supplied threshold for specifying the number of 
processes. 

Similarly, Arthur and Vassilvitskii [17] proposed k-mean- 
s++ which provides a seeding method which intelligently 
chooses cluster centers from a dataset. It overcomes the possi- 
bility of poor clustering outcome due to initial centroid values. 
Kanungo et al. [18] modified k-means using k-dimensional tree 
(kd-tree) data structure which makes execution of each step of k- 
means faster. A k-d tree data structure used for forming certain 
quantity of points in a space with k dimensions. K-d tree firstly 
stores all data objects maintaining a subset of candidate centers. 
The centers are filtered to pass to its children. This tree saves 
time at it does not require updating at each iteration. The corset 
points represent a weighted version of original points. In order to 
get speedup in k-means, Frahling et al. [19] uses coresets to 
quickly find groups of similar set of points for more than one 
values of K. It is good and useful when user is unaware of correct 
number of K prior the execution of k-means. K-means is 
modified in [20] by removing noise sensitivity from k-means 
using Minkowski metric. It assigns feature weight for each 
cluster using Minkowski distance measurement. In this modi- 
fied k-means, Feature weights in Minkowski appeal like a 
feature rescaling factors in a classical k-means. Likas et al. [21] 
proposed a work where various k-means processes is maintained 
for numerous clusters. Authors suggested for a modification of 
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k-means so that it can execute parallel in order to obtain 
efficiency. 

Zhang and Forman [22] observe that clustering large 
datasets requires clustering algorithms to be parallelized. They 
proposed a parallel k-means in order to obtain efficiency in 
clustering large datasets. A performance function is designed 
which input data objects N and K cluster centers location so as 
to estimate a value M. The input data objects are disseminated 
among a set of computers (C). Estimated by performance 
function value M, K cluster centers are selected and consistent 
copies are kept in each C. Now iteratively each computes its 
part of global statistics S for N which provides information to 
performance function. Then S value is broadcasted to the all 
computers after summation across processors. Based on the 
value of S, each computer settles on a new centroid. A 
downside of this method is the computation and storage 
requirement at the central site for data duplication from/to 
remote site slows down the entire process. 

K-means can also be parallelized using other parallel 
computing architectures such as OpenMP and MPI (Message 
Passing Interface) in order to get efficiency. MPI provides a 
message passing programming model for parallel architec- 
tures. MPI model is very helpful in creating efficient and 
scalable parallel applications. In [6], k-means is modified on 
MPI model by adding a merging algorithm in k-means. The 
data parallelization is achieved by evenly distributing data 
objects to all processes and replicating centroids. After each 
iteration the global operation on centroids is done through a 
merging algorithm which combines generated centroid sets 
from each process into a final centroid set with a greedy style. 
Then it output with K centroids, I/O execution time and time 
of clustering. The work is stable and efficient for large data- 
sets. This work did not provide any theoretical analysis on 
performance for varying number of processes. Similarly in 
[23], MPI programming model is used to parallelize k-means 
by dividing algorithmic jobs among processors and accessing 
through distributed memory in order to get speed up in 
execution time. It is found that the communication costs 
become insignificant to the overall clustering time as the 
number of data objects increases. 

K-means is experimented with different parallelization 
platforms in order to compare the efficiency among them. In 
[24], authors implemented k-means in different parallel 
framework such as OpenMP, MPI and Cuda-C and compared 
their performance. It is observed that for small datasets, 
OpenMP performs best whereas cuda works well with large 
datasets. Similarly in [25], parallel version of k-means is 
executed on OpenMP, MPI and Cuda and the result shows that 
the performance of parallel k-means is way better than 
sequential access and performance varies with different plat- 
form and hardware combination. In [26], author experimented 
and compared three standard frameworks for parallelization: 
MPI, OpenMP and MapReduce. It is observed that for small 
datasets and sufficient processor cores and memory OpenMP 
is a best choice for parallelization whereas for moderate data 
size MPI is a good choice. It is also stated that for real world 
large datasets, MapReduce is more efficient than MPI and 


OpenMP. The most time consuming part of k-means is the 
iterative distance computation. Authors in [27] Observes that 
optimizing this iterative part of the algorithm is the key area 
for gaining efficiency through parallel implementation of the 
algorithm. The requirements of using MapReduce on top of 
Hadoop distributed architecture in modifying k-means are 
provided below: 


e OpenMP and MPI are found efficient in clustering only 
when the size of the dataset is small or moderate. Map- 
Reduce model can achieve major performance gain in 
processing large datasets [28]. 

e The distance calculation from objects to centroids is the 
iterative part in k-means. Hence if distance calculation 
part is designed on MapReduce and executed in parallel on 
Hadoop, then the algorithm will gain efficiency in clus- 
tering [28]. 

e The data distribution and result accumulation among nodes 
in distributed architecture requires significant complication 
in programming and have an adverse effect in code effi- 
ciency. A distributed architecture like Hadoop is thus 
required that implicitly manages these operations [29]. 

e The distributed architecture design should put up nodes 
with low cost commodity hardware so that the algorithm 
can be executed anywhere anytime. Hadoop cluster can be 
a good choice [30]. 

e Scalability is an important requirement in distributed 
processing. Hadoop is scalable as a cluster can accom- 
modate any number of nodes at any time [31]. 

e Fault tolerance is another important factor in a distributed 
framework. Fault tolerance guarantees that if a node fails 
while an algorithm executing then the data lost in the 
node's memory should be recovered. Hadoop provides 
fault tolerance in node(s) failure [32]. 


3. Methodology 


In this work, we have modified traditional K-means algo- 
rithm into parallel K-means using MapReduce paradigm and 
executed on the top of Hadoop platform to reduce execution 
time clustering in order to cluster document dataset. The major 
objective of this paper is to discover the clustering efficiency 
in terms of execution time of proposed k-means over classical 
k-means on different size of document dataset. The contribu- 
tion of this work is in the design of traditional K-means into 
MapReduce based K-means which works on vector space 
model of text datasets for document clustering. We believe 
that the design of K-means for document clustering could 
become a framework to be used for parallelizing other clus- 
tering algorithms. 


A) Hadoop Framework: Hadoop provides a simple but robust 
distributed programming framework for processing large 
dataset in parallel. It provides distributed storage and 
computation through a cluster of commodity computers. 
Each node in a cluster provides their own capacity of 
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computation and total capacity of a Hadoop cluster is 
determined by cumulative computing power of nodes 
connected in a cluster. Hadoop works on master/slave 
architecture (see Fig. 1). One node in a cluster is deter- 
mined as master by the user whereas other nodes become 
slaves. 

The user interacts only through master by a command 
interface and master node coordinates with slave nodes for 
storage and computation distribution. For instance, the 
user store dataset in the master node and execute programs 
through master node. It becomes the responsibility of 
master node to split the dataset across slaves for compu- 
tation and then to accumulate results of computation from 
slaves. The capacity of storage and processing is provided 
by two basic Hadoop modules: 

e A Distributed File System: The data storage and 
management part of the Hadoop cluster is provided 
by Hadoop Distributed File System (HDFS). Each 
node in a cluster contains HDFS. Before executing 
a data intensive application HDFS of master node 
divide the dataset and send datasets to each slave in 
the cluster. HDFS replicate a data splits to multiple 
nodes so that in case of failure of a node, data can 
be recovered from other node. Slave nodes can 
communicate among themselves to rebalance data, 
to transfer copies around and to keep data repli- 
cation. This way HDFS provides fault tolerance 
hence reliability to data. 

e MapReduce Programming Model: MapReduce 
programing paradigm is used by Hadoop archi- 
tecture to process the large datasets across a 
number of connected nodes. Hadoop data pro- 
cessing part is provided by MapReduce. The user 
submits MapReduce jobs to the master. The 
master transfers the job to available slave nodes in 


Master Node 


the cluster. MapReduce programming model 
works as follows: 
MapReduce model takes input as a set of <key, value> 
pairs and outputs <key, value> pair set. Each data pro- 
cessing algorithm is run on MapReduce model using two 
functions: Map and Reduce. Prior to submit data to the 
mapper, data should be transformed to <key, value> pairs 
as mapper can only deal with it. When a data is submitted 
to HDFS, it is assigned a key and a value to it. Content of a 
line in the dataset, apart from line terminator considered as 
the value and the offset of the beginning of the line of the 
dataset s considered as key. 
e Map function: It takes an input <key, value> pair 
and outputs a set of intermediate <key, value> pairs. 
For each intermediate key i, all intermediate values 
are collected and then provided to reduce function. 
e Reduce function: It receives a key and a set of 
values for that key from mapper then gathers these 
values to arrange a possibly reduced set of values. 
Generally, each reduce invocation provides zero 
or one output value. 


A Basic MapReduce Data Processing Scheme is depicted in 


Fig. 2. 


B) Proposed k-means algorithm execution stages: In this 


section we provide the necessary details of parallel k- 
means for MapReduce programming model. Fig. 3 pro- 
vides the steps used for clustering voluminous document 
dataset using parallel K-Means algorithm based on Map- 
Reduce paradigm. 


Step 1: Preprocess the voluminous document dataset: The 
text data of a document need to be preprocessed and 
transformed to a form so that k-means can use it as an 


Slave Nodes 


Fig. 1. Hadoop master/slave architecture. 
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Map () 


Reduce () 


Reduce 


Split pair 
< ki, vi> 


Sort by 
each ki 


Merge pair 
< ki, [v1, v2...]> 


Fig. 2. A basic MapReduce data processing scheme. 


appropriate input. In literature documents are preprocessed 
using various techniques such as vector model, graphical 
model, stemming etc. K-means can take clustering input of 
numeric values. Hence, the dataset is processed using 
vector-space model technique which represents each word 
in the dataset into its normalized numeric quantity based on 
its occurrences in the dataset. The normalized numeric 
quantity is then stored in a file and then provided as an 
input to the k-means through HDFS. 

Transformation of documents dataset is suitable to input in 
k-means as it represents a dataset to a set of multidimen- 
sional numeric vector values. This multidimensional vector 
is constituted by many single dimensions. A dimension is 
the weight of each unique word (term) in the dataset. The 
“weight” reflects the relevancy of the corresponding terms 


Mapper 
Calculate centroids and objects assigned to 
each centroid, based on Euclidean distance 


Iterative Part 


in the given dataset. If a corpus comprises of n terms, let t;, 
where i = 1...n, then document d from that dataset would 
be characterized with the vector: d = {wy, Wo, ...,Wn }, 
where w, are weights linked with terms t;. The relationship 
degree among documents can be characterized by distance 
among vectors of the corresponding documents in vector 
space. The dataset is converted to vector space model after 
implementing widely used pre-processing techniques such 
as normalizing the text, removing terms with very small/ 
high frequency by using Zipf's rule (logarithmic scaling), 
removing the so-called stop-words, reducing words to their 
root form through stemming. 

Step 2: Selection of K cluster centroids: K-means requires 
the number of output cluster number to be supplied before 
execution of the algorithm. Hence, we provided the number 


Fig. 3. The Stages of our document clustering using parallel K-means. 
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Mappers 


Compute k cluster 
“ centroids for split $1 


Data Split S1 


Data Split Sn 


Centroids S1 


Reducer 


Create global 
centroids by 
merging 
centroids 
from splits 
$1... Sn 


Pass new centroids 


Fig. 4. Parallel K-means iterations in a Hadoop cluster. 


of cluster centroid K before execution of the algorithm. The 
number of centroids, K, are randomly generated by the 
program and stored in a file named cluster_centroids. These 
values of centroids are used in first iteration of parallel k- 
means in map function. Subsequent map operations will 
iteratively recalculate cluster centroids and will update the 
centroid values, keeping the value of K same. 

Step 3: MapReduce parallel execution of k-means: The 
dataset after preprocessing generates numeric vector valued 
representation of the input document dataset and the vec- 
tors are stored in a file named input_data. The number of 
clusters, k, is then taken as input from the user. The k 
number of vector values, the cluster centers, is taken 
randomly from the dataset and stored in a file named cen- 
troid_data. Input_data is split across slave nodes and cen- 
troid_data is copied to each slave by the HDFS. The 
parallel execution of our k-means in MapReduce is con- 
ducted on a Hadoop cluster of 10 nodes. 


The choice of implementing an algorithms by dividing it 
into map and reduce parts is problematic. We observed that 


Fl {ol, 02,......... ,on} 
existing _centroids—C 


1 
2 
3. Distance (x, y) = 
4 


FOR all oi € F1 such that 1<i<f do 
bestCentroid=NULL 
dist_min=0o 

5. FOR all values of C do 


similarity=distance (oi , c) 


the execution of k-means can be divided into two parts: 
firstly, the parallel and iterative part of calculating distance 
between centroids and dataset objects, which as a result 
assign each object to the nearest centroid; secondly the 
sequential part of updating new centroids after objects are 
assigned to it after each iteration. Taking the above obser- 
vation, we designed our parallel k-means such that the map 
function accomplishes the job of assigning each object to the 
nearest center while the reduce job achieves the procedure of 
updating the new cluster centers, until it remains unchanged. 
A pictorial overview is presented in Fig. 4. When the 
centroid values remain unchanged it indicates that the 
clustering job is accomplished successfully and the result is 
displayed. 


3.1. Algorithm for mapper 
Input: A set of document objects transformed in numeric 


vector form O = {ol, 02, ..., on}, randomly chosen set of 
initial cluster centers C = {c1, c2, ck} 


d Cxi — y;)?, where x, y are objects and in dimension i 


6. IF (bestCentroid = null || similarity < dist_min) then 
dist_min < similarity 


bestCentroid=c 


Ti END IF 

8. END FOR 
release (bestCentroid, 01) 
i=i+1 

9. END FOR 


10. Return intermediate <key, value> pair 
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Output: An intermediate <key, value> pair of (Cj, Oj) 
where 1 <i<nand1<j<k. 
Function: 


3.2. Algorithm for reducer 


Input: (Key, Value) pair outputted by mapper, where 
key = bestCentroid and value = Objects assigned to it. 


[centroid] = { } 
newCentroid =NULL 


A ate ca aa 


FOR all a € oppair do 
oldCentroid — a.key 
object — a.value 
[centroid] — object 

6. END FOR 


categories such as sport, politics, religion etc. It has 20 di- 
rectories, each consist of news of a particular category. The 
dataset is a collection oflarge amounts of unstructured text 
data. The dataset consist of different sizes of 100, 250, 500, 
750 and 1024 megabytes. 


4.2. Experimental result and its analysis 


oppair =oppair (<Key, Value> pairs from mappers) 


Let a= <key, value> pair outputted from mappers 


7. FOR all centroid e [centroid] do 


sumofAllObjects= NULL 


FOR all object € [centroid] do 
sumofAllObjects = sumofAllObjects + object 
numofObjects = numofObjects+ 1 


8. END FOR 


newCentroid = (sumofObjects / numofObjects) 
release (oldCentroid, newCentroid) 


9. END FOR 


Output: <Key, Value> pair where key and value are old- 
Centroid and newCentroid respectively; bestCentroid of 
mapper is used to calculate value for newCentroid. 

Function: 


4. Results and analysis 
4.1. Experimental setup 


The experiment was carried out on a Hadoop cluster of 
ten nodes. The nodes in the Hadoop cluster are configured 
with Intel Core 2 Duo CPU@ 2.53 GHZ processor, 8 GB of 
DDR3 RAM for each node and 80 GB of hard disk with a 
measured bandwidth for end-to-end TCP sockets of 100 MB/ 
s. Operating system used is Ubuntu 14.04 LTS and Hadoop 
version 2.7.2. The dataset we used in this experiment are 
newsgroup document. It contains news from different 


Table 1 
Execution time in seconds. 


Algorithm Observation 

Sequential k-means Execution time 151 
Ratio 1 

Proposed k-means in 10 node cluster Execution time 52 
Ratio 1 

Proposed: sequential execution time 1:2.9 


The variation in dataset sizes helps evaluate the perfor- 
mance gain of our proposed algorithm effectively. The 
sequential k-means algorithm is also experimented with the 
same datasets and system configuration and performance dif- 
ferences between sequential and proposed k-means is pre- 
sented and analyzed. 

In Table |, the observation of different execution time with 
respect to different dataset is provided for sequential and 
parallel execution of k-means. To evaluate the performance of 
proposed k-means, data scale up method of evaluation is used. 
In our data scale up experiments, proposed k-means and 
sequential k-means are executed with respect to different 
dataset sizes for a fixed size of 10 node Hadoop cluster and the 
execution time is recorded for each experiments. The resulting 
execution time of experiments provides a framework to 
analyze and evaluate the performance differences between 
proposed and sequential k-means. 


Dataset size 


100 MB 


250 MB 500 MB 750 MB 1024 MB 
195 520 637 649 

1.3 3.4 4.2 4.3 

87 110 133 140 

1.7 2.1 2:5 27 

1:2:5 1:4.7 1:4.8 1:4.6 
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The execution time taken by sequential k-means is 151, 
195, 520, 637 and 649 s for 100 MB, 250 MB, 500 MB, 
750 MB and 1024 MB of datasets whereas proposed k-means 
took 52, 87, 110, 133 and 140 s respectively. We have 
analyzed the execution time obtained from all experiments in 
order to retrieve the performance of proposed k-means and 
compared with sequential k-means. Ratio is an effective tool 
for comparison. Ratio represents a relationship between two 
quantities that the number of times one value contains or is 
contained within the other. To critically analyze and evaluate 
performance of proposed k-means, execution time of 100 MB 
of dataset is taken as a unit and execution time of other dataset 
is compared using ratio calculation and representation. For 
example, observed execution time of proposed k-means for 
100 MB dataset which is 52 s is considered as a unit and the 
execution time of other dataset is compared and represented in 
the table. The ratio of execution time between sequential and 
proposed k-means with respect to different dataset size is also 
shown in Table 1. A ratio is also calculated for comparing the 
efficiency between sequential and proposed k-means. The 
performance gain of proposed k-means against sequential k- 
means can be directly obtained from this ratio. For example, 
the ratio between sequential and proposed k-means is 1: 2.5 
for 250 MB of dataset size. Hence the performance gain of 
proposed k-means over sequential k-means is two and half 
time more than sequential k-means. The sequential and pro- 
posed k-means execution time for different dataset size is also 
depicted using bar charts in Fig. 5 and Fig. 6 respectively and 
using line charts using Fig. 7 and Fig. 8 respectively. The 
execution time of our proposed algorithm is significantly 
lesser than the sequential implementation of k-means. It is 2.9, 
2.5, 4.7, 4.8 and 4.6 times faster than sequential k-means for 
100 MB, 250 MB, 500 MB, 750 MB and 1024 MB datasets. It 
is also observed that the clustering using proposed algorithm 
becomes more efficient as the input dataset becomes larger. 
For example, it takes 52 s to cluster 100 MB dataset whereas 
for 1024 MB dataset, which is 10 times large, it takes only 
140 s to achieve clustering, which is just 2.7 times more time 
consuming. This is due to the fact that MapReduce performs 
more efficiently for the large datasets. The execution time of 
proposed algorithm is depicted using bar graphs and line 
graphs in Figs. 5 and 6 respectively. Similarly, the execution 


10 Node Data Scale-up 


Time(Seconds) 


100MB 250MB 500MB 750MB 1024MB 


Data Size 


Fig. 5. Bar graph of proposed K-means execution. 


10 Node Data Scale-up 


Time(Seconds) 


100MB 


250MB 500MB 750MB 1024MB 


Data Size 


Fig. 6. Line graph of proposed K-means execution. 


time of sequential k-means is depicted using bar graphs and 
line graphs in Figs. 7 and 8 respectively. 

It is clearly observed from the bar and line charts that the 
increase of execution time with respect to increase of data size 
is not very flat i.e. it cannot be portrayed using a line graph 
with a straight line. This uneven increase of execution time for 
dataset size happens in almost all Hadoop clusters as nodes in 
the cluster have to carry many other overhead other than 
MapReduce execution. Some of the overheads a cluster has to 
bear is processing overhead of systems processes and tools 
(like background antivirus tools) execution with different 
priority, networking overhead between nodes, replication 
factor of Hadoop, HDFS checksum etc. 


5. Conclusion and future work 


MapReduce programming model for Hadoop cluster is a 
recent and popular trend in analyzing large datasets in short 
span of time. It is important to parallelize clustering algo- 
rithms using MapReduce for efficiency in clustering result in 
terms of execution time. This work proposed a parallel k- 
means algorithm using MapReduce for document clustering 
and the execution time of clustering job is compared with 
sequential k-means algorithm with datasets of different size. 
The proposed algorithm is able to cluster datasets in short span 
of time by utilizing the Hadoop cluster of 10 nodes. The 
experimental results give us the following insight: 


Sequential Data Scale-up 


Time(Seconds) 


100MB 250MB 500MB 750MB 1024MB 


Data Size 


Fig. 7. Bar graph of sequential K-means execution. 
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Sequential Data Scale-up 


Time(Seconds) 


100MB 250MB 500MB 750MB 1024MB 


Data Size 


Fig. 8. Line graph of sequential K-means execution. 


e Document clustering can be effectively implemented in 
MapReduce by writing suitable mapper and reducer parts 
in k-means. 

e The efficiency of the proposed k-means outperforms 
sequential k-means in terms of execution time. 

e The proposed algorithm is more efficient while clustering 
larger datasets than smaller. 

e It is observed from line charts and bar graphs that the 
processing time of Hadoop cluster for different sized 
dataset is not uniform, although same configuration of 
nodes and same algorithm is used. 


In this study, the proposed k-means is just modified in order to 
execute on top of Hadoop. The inherent issues of k-means are left 
overlooked and unsolved. Like sequential k-means, proposed k- 
means is also required to be supplied with k cluster numbers 
prior the execution. Likewise, the clustering result of proposed k- 
means is dependent on initial centroid selection. There are al- 
ways many possibilities available for all research work to be 
improved. As a future work we can incorporate proposed algo- 
rithm and framework with the concepts provided below: 


e The proposed k-means can be modified such that it can 
automatically settle on the number of cluster and effec- 
tively select initial centroids depending on the datasets. 

e Proposed k-means can be combined with techniques like 
hierarchical clustering algorithms, swarm intelligence, 
fuzzy logic, gravitational search algorithms, neural net- 
works etc. in order to obtain better quality clustering with 
efficiency. 

e Optimization of the proposed algorithm can also be done 
by tuning the number of mapper and reducer in the code 
effectively and/or adding a combiner between mapper and 
reducer. 

e Hadoop provides choices to optimize on disk, memory, 
network and CPU. Hadoop cluster can be optimized for 
each particular job for more efficiency. 
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